The average channel in the top 1000 channels by most subscribers is about six years old.

VidStatsX treats YouTube's automatically generated channels, like #Music, as real channels and many of them have enough subscribers to be in the top 1000. Since the YouTube API (and most people, probably) do not consider them real channels, they're not included. Without these channels, there are only 964 channels in the dataset.


In [5]:
def median(l):
    l = sorted(l) #sort the list
    if len(l) % 2 == 1: #Even number of items
        return float(l[len(l)/2])
    else:
        return float(l[len(l)/2]+l[(len(l)/2)-1])/2

average_age = sum(top_users_ages)/len(top_users_ages)
median_age = median(top_users_ages)
print("Average (days): " + str(average_age) + "; Median: " + str(median_age))
print("Average (years): " + str(average_age/365.0) + "; Median: " + str(median_age/365.0))
print("Number of channels: " + str(len(top_users_ages)))


Average (days): 2138; Median: 2085.5
Average (years): 5.85753424658; Median: 5.71369863014
Number of channels: 964

Educational channels are slightly older than the average channel, by about six months. Gaming channels are on average only slightly younger, but the median gaming is much younger than the median channel, by about seven months.


In [6]:
edu_average_age = sum(edu_ages)/len(edu_ages)
edu_median_age = median(edu_ages)

print("Educational Average: " + str(edu_average_age/365.0) + "; Median: " + str(edu_median_age/365.0))

gaming_average_age = sum(gaming_ages)/len(gaming_ages)
gaming_median_age = median(gaming_ages)

print("Gaming Average: " + str(gaming_average_age/365.0) + "; Median: " + str(gaming_median_age/365.0))


Educational Average: 6.24383561644; Median: 6.28219178082
Gaming Average: 5.67397260274; Median: 5.08356164384

Code to generate dataset

This initial code scrapes the top user lists from a given VidStatsX url.


In [1]:
from bs4 import BeautifulSoup
import requests
import arrow

def get_users(url="http://vidstatsx.com/youtube-top-200-most-subscribed-channels"):
    """Get the users from a VidStatsX page."""
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    return [x.get('id') for x in soup.find_all("td") if x.get('id') is not None]

Now that we have a function to get the users, we can ask YouTube for information about them, including the start dates. From there, we can convert those dates into ages using a third function.


In [2]:
def get_start_dates(users):
    request_url = "https://www.googleapis.com/youtube/v3/channels?part=snippet&forUsername="
    key = "&key=AIzaSyCZx95H8pP-csC_6G8mF5tv-kW_U20HJKs"
    responses = [ requests.get(request_url + x + key) for x in users] #Raw content from YouTube
    
    return [x.json()['items'][0]['snippet'].get('publishedAt') for x in responses if len(x.json()['items']) > 0]

def get_ages(users):
    start_dates = get_start_dates(users)
    
    return [int((arrow.now() - arrow.get(x)).days) for x in start_dates]

With all of our functions written, we can use them to find the dates of the top 1000 channels.

Note that after the top 200, the pages start where the last one left off, so the top 500 most subscribed channels page includes only channels from 201 to 500.


In [3]:
top_users_ages = get_ages(get_users() +
             get_users("http://vidstatsx.com/youtube-top-500-most-subscribed-channels") + 
             get_users("http://vidstatsx.com/youtube-top-750-most-subscribed-channels") +
             get_users("http://vidstatsx.com/youtube-top-1000-most-subscribed-channels"))


/Users/alec/.virtualenvs/notebook-new/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

VidStatsX also includes charts by category, so we can get the results by category, too.


In [4]:
edu_ages = get_ages(get_users("http://vidstatsx.com/youtube-top-100-most-subscribed-education-channels"))
gaming_ages = get_ages(get_users("http://vidstatsx.com/youtube-top-100-most-subscribed-games-gaming-channels"))

In [ ]: